What is data wrangling

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data (Dasu and Johnson 2003).

At its heart, data wrangling is the set of techniques to get data into a usable form for analysis and visualization.

Data wrangling is a catchall for data importing, data cleaning, and data transformation to create some product with the finished data.

Tidy data

All messy data is unhappy in its own way. To put some structure on data, we prefer that our data is tidy. In tidy data, every column is a variable, every row is an observation, and every cell is a single value (Wickham 2014).

Tidy datasets provide a standardized way to link the physical appearance of the dataset with its meaning (Wickham 2014). Datasets are made of rows and columns that contain a collection of values

Challenge 1

Here is a dataset inspired by an Economist/YouGov Poll about the importance of unemployment in the United States.

challenge_1 <- tibble(
  issue_importance = c("Very Important", "Somewhat Important",
                       "Not very Important", "Unimportant"),
  `18-29` = c(.59, .28, .08, .05),
  `30-44` = c(.66, .27, .06, .02),
  `45-64` = c(.69, .28, .03, .01),
  `65+` = c(.7, .27, .02, .01)
)

head(challenge_1)
## # A tibble: 4 × 5
##   issue_importance   `18-29` `30-44` `45-64` `65+`
##   <chr>                <dbl>   <dbl>   <dbl> <dbl>
## 1 Very Important        0.59    0.66    0.69  0.7 
## 2 Somewhat Important    0.28    0.27    0.28  0.27
## 3 Not very Important    0.08    0.06    0.03  0.02
## 4 Unimportant           0.05    0.02    0.01  0.01

Answer the following:

  1. How many observations are in the data set?

  2. How many columns are in the data set?

  3. Is every column a variable?

Challenge 2

Consider the data below. Both data sets display information on heart rate observed in individuals across three different time periods. But the data are organized differently in each table. Which one of these do you think is the tidy format?

wide <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory"),
  time1 = c(67, 80, 64),
  time2 = c(56, 90, 50),
  time3 = c(70, 67, 101)
)
wide
##      name time1 time2 time3
## 1  Wilbur    67    56    70
## 2 Petunia    80    90    67
## 3 Gregory    64    50   101
long <- data.frame(
  name = c("Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory", "Wilbur", "Petunia", "Gregory"),
  time = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  heartrate = c(67, 80, 64, 56, 90, 50, 70, 67, 10)
)
long
##      name time heartrate
## 1  Wilbur    1        67
## 2 Petunia    1        80
## 3 Gregory    1        64
## 4  Wilbur    2        56
## 5 Petunia    2        90
## 6 Gregory    2        50
## 7  Wilbur    3        70
## 8 Petunia    3        67
## 9 Gregory    3        10

We often refer to these two opposite structures as “long” vs. “wide” formats. In the “long” format, you usually have 1 column for the observed variable(s), and the other columns are ID variables.

For the “wide” format, each row is often a site/subject/patient, and you have multiple observation variables containing the same type of data.

Tidy data

Let’s make the data tidy.

## # A tibble: 6 × 3
##   issue_importance   age   percent_agreement
##   <chr>              <chr>             <dbl>
## 1 Very Important     18-29              0.59
## 2 Very Important     30-44              0.66
## 3 Very Important     45-64              0.69
## 4 Very Important     65+                0.7 
## 5 Somewhat Important 18-29              0.28
## 6 Somewhat Important 30-44              0.27

The format here is now tidy. Each column represents a single variable. The separate age columns have been combined into a single column. Each row represents a single cross-sectional observation. Finally, each cell represents a single value.

The Tidyverse

The tidyverse is a core set of packages that most R programmers use daily for everyday data science tasks. We can load this set of packages all at once with:

library(tidyverse)

Alternatively, we can load individual packages from the tidyverse. For example, we can load the dplyr package.

library(dplyr)

The Grammar of Data Wrangling

The dplyr package provides many very useful functions for manipulating data frames. These functions will save you time by reducing repetition. As a bonus, you might even find the dplyr grammar easier to read.

In this workshop, we will cover 6 of the most commonly used functions and use pipes (%>%) to combine them.

  1. filter(): pick out observations based on their values
  2. select(): pick certain columns by name
  3. group_by(): group observations based on a variable
  4. arrange(): reorder observations based on a variable
  5. summarize(): summarize observations based on some function
  6. mutate(): create new variables with function of existing variables

Each of these verbs works similarly. The first argument is always a data frame object. The additional arguments describe what to do with the data frame using unquoted variable names. The result is a new data frame.

An Example Workflow

Before diving into each function, let’s give an example of a data wrangling workflow that starts from reading in a dataset and then makes a dataset suitable for a descriptive analysis. This example contains an example of every verb we will talk about in this workshop.

Suppose we have a dataset of individuals’ occupations from different US states. In the data folder for this workshop repository, a dataset called “jobs” matches this description. We can read it in with read_csv(), which is available from the tidyverse. read_csv works exactly like read.csv but is much quicker.

The here function is a way to create file paths relative to the top-level directory.

For a descriptive analysis, we’d like to know the average life expectancy for individuals who are and are not in agriculture for states within each state on the West Coast. To accomplish this task, we will need to filter our list of states, group by each state, create a variable denoting whether the individual works in agriculture or not, and summarize the average age of individuals who do and do not work in agriculture. Finally, we’d like to arrange the states in reverse alphabetical order.

What follows is an example of the workflow just described in the spirit of teaching the whole game

## Data import 
jobs <- read_csv(here("data/jobs.csv"))
## New names:
## * `` -> ...1
## Rows: 1000 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): State, Education, Employment, Marital, Political, Name, Occupation
## dbl (3): ...1, Age, Income
## lgl (1): Smokes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Workflow 
jobs %>% 
  select(State, Occupation, Age)%>%
  filter(State %in% c("Alaska","California", "Oregon", "Washington"))%>%
  mutate(in_ag = if_else(Occupation == "Farmer", "Yes", "No"))%>%
  group_by(State, in_ag)%>%
  summarise(average_lifeExp = mean(Age))%>%
  arrange(desc(State))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.
## # A tibble: 8 × 3
## # Groups:   State [4]
##   State      in_ag average_lifeExp
##   <chr>      <chr>           <dbl>
## 1 Washington No               55  
## 2 Washington Yes              64.5
## 3 Oregon     No               49.3
## 4 Oregon     Yes              58  
## 5 California No               53.8
## 6 California Yes              53.6
## 7 Alaska     No               64  
## 8 Alaska     Yes              77
## We can also save the output of our workflow to a new data frame object
wc_lifeExp <- jobs %>% 
  select(State, Occupation, Age)%>%
  filter(State %in% c("Alaska","California", "Oregon", "Washington"))%>%
  mutate(in_ag = if_else(Occupation == "Farmer", "Yes", "No"))%>%
  group_by(State, in_ag)%>%
  summarise(average_lifeExp = mean(Age))%>%
  arrange(desc(State))
## `summarise()` has grouped output by 'State'. You can override using the `.groups` argument.

The way to think of the pipe, the symbol at the end of each of these lines that is %>%, is that it takes what is on the left side and makes it the first argument of the right side.

The tidyverse is built around pipes because the first argument of every function in the tidyverse is a data frame, which means that the tidyverse all follows the same structure (or API). When doing data tasks, having the same structure turns out to be very useful because it is easy to reason about what a function will do and what it needs.

Pipes

A pipe operator %>% or |> is a method of chaining together functions. A pipe takes what is on the left-hand side of the pipe and makes it the first argument of the function on the right-hand side by default.

## Pseudo code 
x %>% f(y) 

## is equivalent to 
f(x, y)

## where f() is a function 

## Example with the jobs data set 
jobs %>% 
  filter(State == "California")

## is equivalent to 
select(jobs, State == "California")

As mentioned, the advantage of the tidyverse is that every function presumes that the first argument is a tidy data object. It is possible to put the left-hand side result as a different argument for the function on the right-hand side by explicitly specifying it. However, a well-written workflow tends to obviate the need to do so.

London Animal Rescue Dataset

To show the benefits of the dplyr verbs, we will start by working with a dataset of animal rescue incidents by the London Fire Brigade from 2009-2021. The data provides information on all incidents and information on the location, date and time, number of fire trucks, and cost of the rescue.

animals <- read_csv(here("data/animalRescue.csv"))
## Rows: 8044 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): IncidentNumber, Ward, Borough, PumpCount, PumpHoursTotal, Type, Pro...
## dbl (2): Year, RescueCost
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(animals)
## Rows: 8,044
## Columns: 10
## $ IncidentNumber   <chr> "139091", "275091", "2075091", "2872091", "3553091", …
## $ Year             <dbl> 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009, 2009,…
## $ Ward             <chr> "Crystal Palace & Upper Norwood", "Woodside", "Carsha…
## $ Borough          <chr> "Croydon", "Croydon", "Sutton", "Hillingdon", "Haveri…
## $ PumpCount        <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ PumpHoursTotal   <chr> "2", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ Type             <chr> "Dog", "Fox", "Dog", "Horse", "Rabbit", "Unknown", "D…
## $ PropertyCategory <chr> "Dwelling", "Outdoor Structure", "Outdoor Structure",…
## $ RescueType       <chr> "Other animal assistance", "Other animal assistance",…
## $ RescueCost       <dbl> 510, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255…

filter()

Suppose we are interested in looking at animal rescues from 2018-2020. filter() allows us to subset observations based on their value.

Since we are introducing a verb explicitly for the first time, we will show the verb with and without pipes. The API is the same for every verb.

## Get all observations from the Year 2015 
filter(animals, Year == 2015)
## # A tibble: 540 × 10
##    IncidentNumber  Year Ward        Borough       PumpCount PumpHoursTotal Type 
##    <chr>          <dbl> <chr>       <chr>         <chr>     <chr>          <chr>
##  1 304151          2015 Hampton     Richmond Upo… 1         1              Dog  
##  2 468151          2015 Chaucer     Southwark     1         1              Unkn…
##  3 1074151         2015 Town        Enfield       1         1              Cat  
##  4 1528151         2015 Kenley      Croydon       1         2              Dog  
##  5 2217151         2015 Willesden … Brent         1         1              Bird 
##  6 2270151         2015 Lea Bridge  Hackney       1         1              Cat  
##  7 2291151         2015 Falconwood… Bexley        1         1              Cat  
##  8 2815151         2015 Teddington  Richmond Upo… 1         1              Cat  
##  9 3211151         2015 New Adding… Croydon       1         1              Cat  
## 10 3409151         2015 Brunel      Hillingdon    1         1              Dog  
## # … with 530 more rows, and 3 more variables: PropertyCategory <chr>,
## #   RescueType <chr>, RescueCost <dbl>
## Get all observations from 2018-2020 
filter(animals, Year >= 2018 & Year <=2020)
## # A tibble: 1,968 × 10
##    IncidentNumber   Year Ward         Borough     PumpCount PumpHoursTotal Type 
##    <chr>           <dbl> <chr>        <chr>       <chr>     <chr>          <chr>
##  1 001058-03012018  2018 Carshalton … Sutton      1         2              Dog  
##  2 001474-04012018  2018 Stanley      Kensington… 1         1              Bird 
##  3 001495-04012018  2018 Dulwich Vil… Southwark   1         1              Bird 
##  4 002204-06012018  2018 Lea Bridge   Waltham Fo… 1         1              Dog  
##  5 002369-06012018  2018 St. Andrew's Havering    1         1              Bird 
##  6 002832-07012018  2018 Garden Subu… Barnet      1         1              Bird 
##  7 003760-09012018  2018 Northcote    Wandsworth  1         1              Bird 
##  8 004039-10012018  2018 Haverstock   Camden      1         1              Cat  
##  9 004051-10012018  2018 Ponders End  Enfield     1         3              Horse
## 10 004435-11012018  2018 Knight's Hi… Lambeth     1         1              Cat  
## # … with 1,958 more rows, and 3 more variables: PropertyCategory <chr>,
## #   RescueType <chr>, RescueCost <dbl>

The first argument is the name of the data frame we want to use, in this case, “animals.” The arguments after are the expressions that filter the data frame.

Here is the same code with a pipe and a neat shortcut

animals %>% 
  filter(Year == 2015)
## # A tibble: 540 × 10
##    IncidentNumber  Year Ward        Borough       PumpCount PumpHoursTotal Type 
##    <chr>          <dbl> <chr>       <chr>         <chr>     <chr>          <chr>
##  1 304151          2015 Hampton     Richmond Upo… 1         1              Dog  
##  2 468151          2015 Chaucer     Southwark     1         1              Unkn…
##  3 1074151         2015 Town        Enfield       1         1              Cat  
##  4 1528151         2015 Kenley      Croydon       1         2              Dog  
##  5 2217151         2015 Willesden … Brent         1         1              Bird 
##  6 2270151         2015 Lea Bridge  Hackney       1         1              Cat  
##  7 2291151         2015 Falconwood… Bexley        1         1              Cat  
##  8 2815151         2015 Teddington  Richmond Upo… 1         1              Cat  
##  9 3211151         2015 New Adding… Croydon       1         1              Cat  
## 10 3409151         2015 Brunel      Hillingdon    1         1              Dog  
## # … with 530 more rows, and 3 more variables: PropertyCategory <chr>,
## #   RescueType <chr>, RescueCost <dbl>
animals %>% 
  filter(Year >= 2018 & Year <= 2020)
## # A tibble: 1,968 × 10
##    IncidentNumber   Year Ward         Borough     PumpCount PumpHoursTotal Type 
##    <chr>           <dbl> <chr>        <chr>       <chr>     <chr>          <chr>
##  1 001058-03012018  2018 Carshalton … Sutton      1         2              Dog  
##  2 001474-04012018  2018 Stanley      Kensington… 1         1              Bird 
##  3 001495-04012018  2018 Dulwich Vil… Southwark   1         1              Bird 
##  4 002204-06012018  2018 Lea Bridge   Waltham Fo… 1         1              Dog  
##  5 002369-06012018  2018 St. Andrew's Havering    1         1              Bird 
##  6 002832-07012018  2018 Garden Subu… Barnet      1         1              Bird 
##  7 003760-09012018  2018 Northcote    Wandsworth  1         1              Bird 
##  8 004039-10012018  2018 Haverstock   Camden      1         1              Cat  
##  9 004051-10012018  2018 Ponders End  Enfield     1         3              Horse
## 10 004435-11012018  2018 Knight's Hi… Lambeth     1         1              Cat  
## # … with 1,958 more rows, and 3 more variables: PropertyCategory <chr>,
## #   RescueType <chr>, RescueCost <dbl>
## which is the same as 
animals %>% 
  filter(between(Year, 2018, 2020))
## # A tibble: 1,968 × 10
##    IncidentNumber   Year Ward         Borough     PumpCount PumpHoursTotal Type 
##    <chr>           <dbl> <chr>        <chr>       <chr>     <chr>          <chr>
##  1 001058-03012018  2018 Carshalton … Sutton      1         2              Dog  
##  2 001474-04012018  2018 Stanley      Kensington… 1         1              Bird 
##  3 001495-04012018  2018 Dulwich Vil… Southwark   1         1              Bird 
##  4 002204-06012018  2018 Lea Bridge   Waltham Fo… 1         1              Dog  
##  5 002369-06012018  2018 St. Andrew's Havering    1         1              Bird 
##  6 002832-07012018  2018 Garden Subu… Barnet      1         1              Bird 
##  7 003760-09012018  2018 Northcote    Wandsworth  1         1              Bird 
##  8 004039-10012018  2018 Haverstock   Camden      1         1              Cat  
##  9 004051-10012018  2018 Ponders End  Enfield     1         3              Horse
## 10 004435-11012018  2018 Knight's Hi… Lambeth     1         1              Cat  
## # … with 1,958 more rows, and 3 more variables: PropertyCategory <chr>,
## #   RescueType <chr>, RescueCost <dbl>

As with the examples we have seen before, we take the animals data frame and then pass it via the pipe to the filter argument. Because the pipe puts the data frame as the first argument for every tidyverse function, we only have to specify the subsequent arguments. between() is a dplyr function that is a shortcut for x >= left & x <= right where x is our variable of interest.

You can extend filter() with logical conditions:

Symbol Meaning
< less than
> greater than
== equal to
<= less than or equal to
>= greater than or equal to
!= not equal to
%in% group membership
is.na is NA
!is.na is not NA
&, |, !, xor, any, all Boolean operators

Exercise 1

Create a new data frame object called cats_and_dogs that filters the animals data frame for all observations for which the animal type is a cat or a dog. Use the pipe operator to chain the functions together.

Multiple Filters at once

Sometimes we want to filter multiple conditions at once. For example, we might be interested in all rescues of birds or horses after 2011. We use the , in our filter expression to accomplish this.

birds_and_horses <- animals %>% 
  filter(Type == "Bird"|Type == "Horse", Year > 2011)
glimpse(birds_and_horses)
## Rows: 1,487
## Columns: 10
## $ IncidentNumber   <chr> "48122", "1454122", "2677122", "5808121", "6451121", …
## $ Year             <dbl> 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012, 2012,…
## $ Ward             <chr> "Wandle Valley", "Noel Park", "Winchmore Hill", "Clis…
## $ Borough          <chr> "Sutton", "Haringey", "Enfield", "Hackney", "Bexley",…
## $ PumpCount        <chr> "1", "1", "2", "1", "1", "1", "1", "1", "1", "1", "1"…
## $ PumpHoursTotal   <chr> "2", "1", "3", "1", "1", "1", "1", "1", "1", "1", "3"…
## $ Type             <chr> "Horse", "Bird", "Bird", "Bird", "Horse", "Bird", "Bi…
## $ PropertyCategory <chr> "Outdoor", "Non Residential", "Outdoor", "Dwelling", …
## $ RescueType       <chr> "Other animal assistance", "Animal rescue from height…
## $ RescueCost       <dbl> 520, 260, 780, 260, 260, 260, 260, 260, 260, 260, 780…

We can put multiple filter functions in a row in a piped workflow, though it does not look as readable.

birds_and_horses2 <- animals %>% 
  filter(Type == "Bird"|Type == "Horse")%>%
  filter(Year > 2011)

## These two data frames are the same 
all.equal(birds_and_horses, birds_and_horses2)
## [1] TRUE

select()

Now we have a data frame of just rescues of cats and dogs. Suppose we are only interested in a few variables of the data frame. We can use the select() function to keep only the variables we select.

To be concrete, suppose we just want to see the Year, Type, and PropertyCategory variables.

cats_and_dogs <- animals %>% 
  filter(Type == "Cat"|Type == "Dog")

cats_and_dogs_short <- cats_and_dogs %>% 
  select(Year, Type, PropertyCategory)

If we open up cats_and_dogs_short, we’ll see that it only contains the Year, Type, and PropertyCategory columns.

Both the select() and filter() functions subset the data frame. The difference is that select() extracts certain columns, while filter extracts certain rows.

When using select and filter together, the order of operations is very important. If we used select() first, the filter() function can only filter on variables that we selected.

We can also select observation variables using: - variable indices - variable names (without quotes) - x:z to select all variables between x and z - -y to exclude y - starts_with(x, ignore.case = TRUE): all names that starts with x - ends_with(x, ignore.case = TRUE): all names that ends with x - contains(x, ignore.case = TRUE): all names that contain x

Split-Apply-Combine

A common task you’ll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculate the total number of cat rescues for each borough?

The abstract problem we’re encountering here is known as “split-apply-combine”:

We want to split our data into groups (in this case boroughs), apply some calculations on that group (in this case count), and then combine the results together afterward. This pattern is common in data wrangling and a cornerstone of the dplyr approach.

group_by()

A grouped_df can be thought of as a list where each item in the list is a data.frame containing only the rows that correspond to the particular value continent (at least in the example above).

Let’s demonstrate the benefits of group_by by answering the question, “What are total number of cat rescues for each borough?”

borough_cat_rescues <- animals %>% 
  ## group all observations by Boroughs
  group_by(Borough)%>%
  ## get just the observations of cat rescues 
  filter(Type == "Cat")%>%
  ## new function! count the unique values of one or more variables
  count(Type)

head(borough_cat_rescues)
## # A tibble: 6 × 3
## # Groups:   Borough [6]
##   Borough              Type      n
##   <chr>                <chr> <int>
## 1 Barking And Dagenham Cat     105
## 2 Barnet               Cat     127
## 3 Bexley               Cat      84
## 4 Brent                Cat     114
## 5 Bromley              Cat     112
## 6 Camden               Cat     137

We pass ungroup() to our data to ungroup our dataset.

animals %>% 
  group_by(Borough)%>%
  filter(Type == "Cat")%>%
  count(Type)%>%
  ungroup()
## # A tibble: 32 × 3
##    Borough              Type      n
##    <chr>                <chr> <int>
##  1 Barking And Dagenham Cat     105
##  2 Barnet               Cat     127
##  3 Bexley               Cat      84
##  4 Brent                Cat     114
##  5 Bromley              Cat     112
##  6 Camden               Cat     137
##  7 Croydon              Cat     161
##  8 Ealing               Cat     131
##  9 Enfield              Cat     146
## 10 Greenwich            Cat     132
## # … with 22 more rows

arrange()

We now have cat rescues by boroughs, but our quick glimpse has them in a strange order. To get the observations from smallest to largest, we can use arrange() and pass the variable name of interest.

borough_cat_rescues %>%
  arrange(n)
## # A tibble: 32 × 3
## # Groups:   Borough [32]
##    Borough                Type      n
##    <chr>                  <chr> <int>
##  1 Kingston Upon Thames   Cat      58
##  2 Harrow                 Cat      66
##  3 Merton                 Cat      69
##  4 Havering               Cat      72
##  5 Sutton                 Cat      79
##  6 Bexley                 Cat      84
##  7 Richmond Upon Thames   Cat      86
##  8 Hillingdon             Cat      87
##  9 Hounslow               Cat      93
## 10 Hammersmith And Fulham Cat     103
## # … with 22 more rows

If we want to reverse the order, we need to wrap the variable name in desc() like so

borough_cat_rescues %>%
  arrange(desc(n))
## # A tibble: 32 × 3
## # Groups:   Borough [32]
##    Borough       Type      n
##    <chr>         <chr> <int>
##  1 Newham        Cat     190
##  2 Haringey      Cat     185
##  3 Lambeth       Cat     175
##  4 Southwark     Cat     174
##  5 Hackney       Cat     162
##  6 Croydon       Cat     161
##  7 Tower Hamlets Cat     155
##  8 Lewisham      Cat     153
##  9 Enfield       Cat     146
## 10 Wandsworth    Cat     139
## # … with 22 more rows

Exercise 2

Let’s put the verbs we’ve learned so far into action all at once. Answer the question “What are the total number of dog rescues for each borough?” by making a data frame called borough_dog_rescues and arranging the rows by borough in alphabetical order.

## Your Code here 

summarize()

summarise() or summarize() collapses a data frame to a single row.

animals %>% 
  summarise(avg_rescue_cost = mean(RescueCost, na.rm = T))
## # A tibble: 1 × 1
##   avg_rescue_cost
##             <dbl>
## 1            358.
## Note that summarise can also be written with a z 
animals %>% 
  summarize(avg_rescue_cost = mean(RescueCost, na.rm = T))
## # A tibble: 1 × 1
##   avg_rescue_cost
##             <dbl>
## 1            358.

summarise() becomes extremely useful when paired with group_by group_by(). We just saw that our last block gave the average cost across our data set. By using the group_by() function, we split our original data frame into multiple pieces, which we then use to run functions (e.g. mean()) within summarize().

Suppose instead we are interested in the average cost of a rescue for each borough.

animals %>% 
  group_by(Borough)%>%
  summarise(avg_rescue_cost = mean(RescueCost, na.rm=T))%>%
  arrange(desc(avg_rescue_cost))
## # A tibble: 37 × 2
##    Borough        avg_rescue_cost
##    <chr>                    <dbl>
##  1 Epping Forest             650.
##  2 Bexley                    431.
##  3 Bromley                   404.
##  4 Enfield                   399.
##  5 Havering                  389.
##  6 Hillingdon                386.
##  7 Tower Hamlets             381.
##  8 Waltham Forest            379.
##  9 Hounslow                  377.
## 10 Camden                    374.
## # … with 27 more rows

We are not limited to grouping by just one variable. We can group by multiple variables. Let’s answer the question, “What’s the average cost by borough year?”

animals %>% 
  group_by(Borough, Year)%>%
  summarise(avg_rescue_cost = mean(RescueCost, na.rm=T))%>%
  arrange(Borough)
## `summarise()` has grouped output by 'Borough'. You can override using the `.groups` argument.
## # A tibble: 431 × 3
## # Groups:   Borough [37]
##    Borough               Year avg_rescue_cost
##    <chr>                <dbl>           <dbl>
##  1 Barking And Dagenham  2009            279.
##  2 Barking And Dagenham  2010            312 
##  3 Barking And Dagenham  2011            329.
##  4 Barking And Dagenham  2012            260 
##  5 Barking And Dagenham  2013            303.
##  6 Barking And Dagenham  2014            310.
##  7 Barking And Dagenham  2015            340.
##  8 Barking And Dagenham  2016            359.
##  9 Barking And Dagenham  2017            426.
## 10 Barking And Dagenham  2018            333.
## # … with 421 more rows

That is already quite powerful, but it gets even better! We are not limited to defining one new variable in summarize(). Suppose we want to know the average deviation within each borough year.

animals %>% 
  group_by(Borough, Year)%>%
  summarise(avg_rescue_cost = mean(RescueCost, na.rm=T),
            sd_rescue_cost = sd(RescueCost, na.rm = T))%>%
  arrange(Borough)
## `summarise()` has grouped output by 'Borough'. You can override using the `.groups` argument.
## # A tibble: 431 × 4
## # Groups:   Borough [37]
##    Borough               Year avg_rescue_cost sd_rescue_cost
##    <chr>                <dbl>           <dbl>          <dbl>
##  1 Barking And Dagenham  2009            279.         104.  
##  2 Barking And Dagenham  2010            312          201.  
##  3 Barking And Dagenham  2011            329.         208.  
##  4 Barking And Dagenham  2012            260            0   
##  5 Barking And Dagenham  2013            303.          77.5 
##  6 Barking And Dagenham  2014            310.          67.9 
##  7 Barking And Dagenham  2015            340.         107.  
##  8 Barking And Dagenham  2016            359.         103.  
##  9 Barking And Dagenham  2017            426.         220.  
## 10 Barking And Dagenham  2018            333.           1.39
## # … with 421 more rows

You will receive a message that says “summarise() has grouped output by ‘Borough’. You can override using the .groups argument.” This is an experimental feature in dplyr and does not change the results of our computation at all. The computation that is identical but leads to R not sending that message is the following:

animals %>% 
  group_by(Borough, Year)%>%
  summarise(avg_rescue_cost = mean(RescueCost, na.rm=T),
            sd_rescue_cost = sd(RescueCost, na.rm = T),
            .groups = "drop_last")%>%
  arrange(Borough)
## # A tibble: 431 × 4
## # Groups:   Borough [37]
##    Borough               Year avg_rescue_cost sd_rescue_cost
##    <chr>                <dbl>           <dbl>          <dbl>
##  1 Barking And Dagenham  2009            279.         104.  
##  2 Barking And Dagenham  2010            312          201.  
##  3 Barking And Dagenham  2011            329.         208.  
##  4 Barking And Dagenham  2012            260            0   
##  5 Barking And Dagenham  2013            303.          77.5 
##  6 Barking And Dagenham  2014            310.          67.9 
##  7 Barking And Dagenham  2015            340.         107.  
##  8 Barking And Dagenham  2016            359.         103.  
##  9 Barking And Dagenham  2017            426.         220.  
## 10 Barking And Dagenham  2018            333.           1.39
## # … with 421 more rows

Exercise 3

Create a data frame that considers the average rescue cost, median rescue cost, and standard deviation for bird rescues for each property category year. Save the result of your workflow into a data frame object called bird_stats

mutate()

mutate() creates new variables to the same data frame that you pass into it.

To demonstrate mutate(), load the gapminder-FiveYearData.csv into RStudio as a data frame object called gapminder

gapminder <- read_csv(here("data/gapminder-FiveYearData.csv"))
## Rows: 1704 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, continent
## dbl (4): year, pop, lifeExp, gdpPercap
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(gapminder)
## Rows: 1,704
## Columns: 6
## $ country   <chr> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ year      <dbl> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ pop       <dbl> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ continent <chr> "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asia", "Asi…
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …

The gapminder dataset has six variables and is in tidy form. One of those variables is lifeExp. Imagine we are interested in comparing the difference between a country’s life expectancy and the average continent-year-wide life expectancy by year. Such a query requires us to make two new variables. The first variable is the mean continent-year life expectancy. The second variable is the difference between life expectancy for a country and the continent’s life expectancy for every year.

mutate() easily handles both cases. We can also use mutate() to create new variables before (or even after) summarizing information.

gapminder_xtra_vars <- gapminder %>% 
  group_by(year, continent)%>%
  mutate(continent_lifeExp = mean(lifeExp, na.rm = T),
         diff_lifeExp = lifeExp - continent_lifeExp)

Note that mutate() does not require a group_by(). We can simply add a new column, either as a function of existing columns or through a new computation. Here’s an example of doing both.

gap_mutate_no_group <- gapminder %>% 
  mutate(gdp_billion = (gdpPercap*pop)/10^9, 
         new_comp = 2+2)

A final useful function to know is the how-to count observations within a mutate. dplyr provides the handy n() function for this computation. Here’s an example.

gap_count <- gapminder %>% 
  filter(year == 2007)%>%
  group_by(continent)%>%
  mutate(numCountries = n())%>%
  # distinct works like select() except it returns only rows 
  # that are unique
  distinct(continent, numCountries)%>%
  arrange(desc(numCountries))

Exercise 4

Add a column to the gapminder dataset that contains the continent’s total population of each observation in a given year. For example, if the first observation was Afghanistan in 1952, the new column would contain the population of Asia in 1952.

Exercise 5

Use dplyr to: (a) add a column called gdpPercap_diff that contains the difference between the observation’s gdpPercap and the mean gdpPercap of the continent in that year, (b) arrange the data frame by the column you just created, in descending order (so that the relatively richest country/years are listed first)

pivot_longer()

Until now, we have been using the nicely formatted, original gapminder dataset. Now, let’s start with the wide-format version of the gapminder dataset.

gap_wide <- read_csv(here("data/gapminder_wide.csv"))
## Rows: 142 Columns: 38
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): continent, country
## dbl (36): gdpPercap_1952, gdpPercap_1957, gdpPercap_1962, gdpPercap_1967, gd...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(gap_wide)
## # A tibble: 6 × 38
##   continent country  gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967
##   <chr>     <chr>             <dbl>          <dbl>          <dbl>          <dbl>
## 1 Africa    Algeria           2449.          3014.          2551.          3247.
## 2 Africa    Angola            3521.          3828.          4269.          5523.
## 3 Africa    Benin             1063.           960.           949.          1036.
## 4 Africa    Botswana           851.           918.           984.          1215.
## 5 Africa    Burkina…           543.           617.           723.           795.
## 6 Africa    Burundi            339.           380.           355.           413.
## # … with 32 more variables: gdpPercap_1972 <dbl>, gdpPercap_1977 <dbl>,
## #   gdpPercap_1982 <dbl>, gdpPercap_1987 <dbl>, gdpPercap_1992 <dbl>,
## #   gdpPercap_1997 <dbl>, gdpPercap_2002 <dbl>, gdpPercap_2007 <dbl>,
## #   lifeExp_1952 <dbl>, lifeExp_1957 <dbl>, lifeExp_1962 <dbl>,
## #   lifeExp_1967 <dbl>, lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
## #   lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>,
## #   lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, …

The first step towards getting our nice, tidydata format is to convert from the wide to the long format.

The function pivot_longer() will “gather” the observation variables into a single variable.

gap_long <- gap_wide %>% 
  pivot_longer(
    cols = 3:38, 
    names_to = "obstype_year",
    values_to = "obs_values"
  )
head(gap_long)
## # A tibble: 6 × 4
##   continent country obstype_year   obs_values
##   <chr>     <chr>   <chr>               <dbl>
## 1 Africa    Algeria gdpPercap_1952      2449.
## 2 Africa    Algeria gdpPercap_1957      3014.
## 3 Africa    Algeria gdpPercap_1962      2551.
## 4 Africa    Algeria gdpPercap_1967      3247.
## 5 Africa    Algeria gdpPercap_1972      4183.
## 6 Africa    Algeria gdpPercap_1977      4910.

Notice that we put three arguments into the pivot_longer function:

  1. We want to gather the column indices of the old observation variables (3:38, signaling columns 3 through 38) into one variable. Notice that we want to keep columns 1 and 2, as these are considered “ID” variables.

An alternative way to do indices is to type in the column names of interest explicitly. Here is an example where we exclude the first two columns.

makeLong <- gap_wide %>% 
  pivot_longer(
    cols = c(-continent, -country), 
    names_to = "obstype_year",
    values_to = "obs_values"
  )
head(makeLong)
## # A tibble: 6 × 4
##   continent country obstype_year   obs_values
##   <chr>     <chr>   <chr>               <dbl>
## 1 Africa    Algeria gdpPercap_1952      2449.
## 2 Africa    Algeria gdpPercap_1957      3014.
## 3 Africa    Algeria gdpPercap_1962      2551.
## 4 Africa    Algeria gdpPercap_1967      3247.
## 5 Africa    Algeria gdpPercap_1972      4183.
## 6 Africa    Algeria gdpPercap_1977      4910.
  1. the name of the new column for the new ID variable (obstype_year),

  2. the name for the new amalgamated observation variable (obs_value)

Alternatively, we can use the column structure in the gap_wide data frame to do the same thing with the starts_with function.

# with the starts_with() function
gap_long <- gap_wide %>%
    pivot_longer(
    cols = c(starts_with('gdpPercap'), 
             starts_with('lifeExp'),
             starts_with('pop')),
    names_to = "obstype_year",
    values_to = "obs_values"
  )
head(gap_long)
## # A tibble: 6 × 4
##   continent country obstype_year   obs_values
##   <chr>     <chr>   <chr>               <dbl>
## 1 Africa    Algeria gdpPercap_1952      2449.
## 2 Africa    Algeria gdpPercap_1957      3014.
## 3 Africa    Algeria gdpPercap_1962      2551.
## 4 Africa    Algeria gdpPercap_1967      3247.
## 5 Africa    Algeria gdpPercap_1972      4183.
## 6 Africa    Algeria gdpPercap_1977      4910.

However you choose to do it, notice that the output collapses all of the measure variables into two columns: one containing the new ID variable, the other containing the observation value for that row.

separate()

You’ll notice that in our long dataset, obstype_year actually contains two pieces of information, the observation type (pop, lifeExp, or gdpPercap) and the year.

We can use the separate() function to split the character strings into multiple variables:

gap_long_sep <- gap_long %>% 
  separate(obstype_year, into = c('obs_type','year'), sep = "_") %>% 
  mutate(year = as.integer(year))
head(gap_long_sep)
## # A tibble: 6 × 5
##   continent country obs_type   year obs_values
##   <chr>     <chr>   <chr>     <int>      <dbl>
## 1 Africa    Algeria gdpPercap  1952      2449.
## 2 Africa    Algeria gdpPercap  1957      3014.
## 3 Africa    Algeria gdpPercap  1962      2551.
## 4 Africa    Algeria gdpPercap  1967      3247.
## 5 Africa    Algeria gdpPercap  1972      4183.
## 6 Africa    Algeria gdpPercap  1977      4910.

pivot_wider()

The opposite of pivot_longer is pivot_wider. It spreads our observation variables back out to make a wider table. We can use this function to spread our gap_long() to the original format.

gap_original <- gap_long_sep %>% 
  pivot_wider(
    names_from = obs_type, 
    values_from = obs_values)
head(gap_original)
## # A tibble: 6 × 6
##   continent country  year gdpPercap lifeExp      pop
##   <chr>     <chr>   <int>     <dbl>   <dbl>    <dbl>
## 1 Africa    Algeria  1952     2449.    43.1  9279525
## 2 Africa    Algeria  1957     3014.    45.7 10270856
## 3 Africa    Algeria  1962     2551.    48.3 11000948
## 4 Africa    Algeria  1967     3247.    51.4 12760499
## 5 Africa    Algeria  1972     4183.    54.5 14760787
## 6 Africa    Algeria  1977     4910.    58.0 17152804

Exercise 6

  1. We have provided the code below to answer Exercise 5. Ungroup the results, then subset them to select only the country, year, and gdpPercap_diff columns. Put it in a wide format so that countries are rows and years are columns.

  2. Now, turn the data frame above back into the long format with three columns: country, year, and gdpPercap_diff.